CapERA: Captioning Events in Aerial Videos

نویسندگان

چکیده

In this paper, we introduce the CapERA dataset, which upgrades Event Recognition in Aerial Videos (ERA) dataset to aerial video captioning. The newly proposed aims advance visual–language-understanding tasks for UAV videos by providing each with diverse textual descriptions. To build 2864 are manually annotated a caption that includes information such as main event, object, place, action, numbers, and time. More captions automatically generated from manual annotation take into account much possible variation describing same video. Furthermore, propose captioning model provide benchmark results is based on encoder–decoder paradigm two configurations encode first configuration encodes frames independently an image encoder. Then, temporal attention module added top consider dynamics between features derived frames. second configuration, directly input using encoder employs factorized space–time capture dependencies within For generating captions, language decoder utilized autoregressively produce visual tokens. experimental under different evaluation criteria show challenges of videos. We expect introduction will open interesting new research avenues integrating natural processing (NLP) understandings.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi Target Tracking on Aerial Videos

In this paper we propose a new method to detect and track multiple moving targets on image sequences recorded by Unmanned Aerial Vehicles (UAVs). Our approach focuses on challenging urban scenarios, where several object are simultaneously moving in various directions, and we must expect frequent occlusions caused by other moving vehicles or static scene objects such as buildings and bridges. In...

متن کامل

The Effects of Captioning Videos on Academic Achievement and Motivation: Reconsideration of Redundancy Principle in Instructional Videos

The purpose of the present study was to investigate the effect of captioned vs. non-captioned instructional videos on the motivation and achievement. To this end, a pre-test and post-test experimental design was used on 109 sophomores from a Turkish state university. Videos with and without captions of the unit in question were prepared by the lecturer of the course “Graphics and Animation in E...

متن کامل

Decoding Hazardous Events in Driving Videos

Decoding the human brain state with BCI methods can be seen as a building block for humanmachine interaction, providing a noisy but objective, low-latency information channel including human reactions to the environment. Specifically in the context of autonomous driving, human judgement is relevant in high-level scene understanding. Despite advances in computer vision and scene understanding, i...

متن کامل

Automatic Recognition of Unpredictable Events in Videos

First we describe what we mean mder “‘predictability”’ of frames in an image s&am. If f~~ames are predictable, they are not as important as the onus that are unpredictable. We will rank these frames lower, since they can be inferred them from the previous flG!lnles. For aample, imagine the following situation. A person enters a camera view field, walks from left to right in front of the camera,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Remote Sensing

سال: 2023

ISSN: ['2315-4632', '2315-4675']

DOI: https://doi.org/10.3390/rs15082139